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Detecting Variability in Massive Astronomical Time-Series Data I: 
application of an infinite Gaussian mixture model 
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ABSTRACT 

We present a new framework to detect various types of variable objects within massive astro- 
nomical time-series data. Assuming that the dominant population of objects is non-variable, 
we find outliers from this population by using a non-parametric Bayesian clustering algorithm 
based on an infinite Gaussian Mixture Model (GMM) and the Dirichlet Process. The algorithm 
extracts information from a given dataset, which is described by six variability indices. The 
GMM uses those variability indices to recover clusters that are described by six-dimensional 
multivariate Gaussian distributions, allowing our approach to consider the sampling pattern 
of time-series data, systematic biases, the number of data points for each light curve, and pho- 
tometric quality. Using the Northern Sky Variability Survey data, we test our approach and 
prove that the infinite GMM is useful at detecting variable objects, while providing statistical 
inference estimation that suppresses false detection. The proposed approach will be effective 
in the exploration of future surveys such as GAIA, Pan-Starrs, and LSST, which will produce 
massive time-series data. 
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1 INTRODUCTION 

Time-domain astronomy has resulted in a variety of discoveries 
such as gamma-ray bursts and supernovae. These kinds of tran- 
sient phenomenon have made it possible to understand a rare stage 
of stellar evolution. Moreover, variable stars have been key objects 
for investigating stellar population s, the structure of the M ilky Way, 
and the expansion of the universe l lBono & Cignonill2005h . 

Despite its long history and contribution to astronomy, the 
study of variable sources is not complete yet. As .Pac zyiiski ( 2 00oh 
emphasised, there might be unknown variable s ources. Moreover, 
know n variable objects are not well understood tever & Mowlavil 
I2OO8I) . Recently, several su rveys revealed a large number of vari- 
able sources as byproducts ( Paczvnski 2001). Even more new vari- 
able sources are expected to be discovered in future surveys (e.g. 
IWalkeiil2003h . 

A common approach in the study of variable sources consists 
of detection, analysis in the time domain, analys is in the phase do- 
main with period estimation, and classification ( lEveill200 5. 2006). 
For each step, various methods have been proposed and tested in 
several projects. One example is a set of variable stars from the 
MACHO project jCook et al.|[T99^ where variable objects are se- 
lected by using chi-square statistics, and the periods of these ob- 



jects are derived from the method explained by iReimanrJ ( 1 1994) . 
The MACHO project also uses a power spectrum of the time-series 
data to separate ou t a specific kind of a variable star from others 
jAlcock et al.lll99^ . In addition, RR Lyrae have bee n investigated 
with their distinctive colour and absolute m agnitude (lAlcock et al.l 
1 19961) . or visual inspection of light curves jAlcock etal.lll997h in 
the MACHO project. 

Period estimation and classification of variable sources 
have been intensively examined by various methods. Pe- 
riod determination has b een tested for data with diverse 
types of light curves (e^. 'Reimann 'l994'; Akerlof et al.| 1 1994 
ISchwarzenberg-Czemvl [l998: Shin & Bvun 2004). Classifica- 
tion has been explored by using stati stical tools, includ- 



ing machine learning algori 


thms (e.g. lEyer & Blakd 
Belokurov. Evans, & Le Du| 


2002; 


Belokurov. Evans. & DiJ 


2003; 


2004; 


Debosscher et al.l |2007|; 


Willemsen & Everl |2007|; iMahabal et al.1 


2008|). 
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However, the general method of variability detection has not 
been well investigated, and a typical method is usually based on 
a simple probabili ty test that is opt imised for specific variability 
types or data (e.g. ISumi et al]|2005l) . Detection algorithms cannot 
be separated from the factors that determine sampled time-series 
data: variability types, observation cadence, quality cuts of data 
samples, noise patterns, systematic biases, etc. Detecting any type 
of variable object depends on the data we have and how we measure 
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variability. Therefore, variability detection has to be a data-oriented 
process without dependence on assumptions about the given data. 

General variability detection methods must be based on the 
following requirements (see Eyer 2006, for a discussion). First, 
the method has to recover a broad range of variability types. Par- 
ticularly, the detection method needs to be able to recover a new 
type of variability. Second, a probabilistic inference has to be de- 
rived in order to help people estimate detection reliability. As the 
amount of data increases, controlling the detection of a false posi- 
tive becomes important. Third, it is critical for the detection method 
to deal with a variety of data sets such as the number of data 
point s, uncertainties in the measured data , and time-sampling pat- 
terns jCarbonelL Oliver, &Ballesteill992l) . Even in a single survey 
project where one defined cadence is valid, people can adopt dif- 
ferent values for data quality cuts because of varying observing en- 
vironments and different properties of each observation field such 
as precision of photometry. Such differences can result in a hetero- 
geneous distribution of data points. 

In this paper, we propose a new framework to detect a broad 
range of variability within massive time-series data. We employ 
an unsupervised Bayesian machine learning algorithm which uses 
an infinite Gaussian Mixture Model (GMM) with the Dirich- 
let P r ocess (DP) (see iKellv & McKavl |2004 lOebosscher et al.l 
l2007l : iBamford et al.ll2008l for an example of the GMM in as- 
tronomy). In this context, separating variable objects from non- 
variab le ones can be regarded as a clustering problem ( Jain et al. 

or detecting outliers from the cluster of non-variable objects 
dCateni. Colla. & Vannuccill2008l) . 

We adopt six variability indices that are measured from light 
curves in the time domain and used as input features for cluster- 
ing with the infinite GMM. These indices summarise the system- 
atic structure of an individual light curve in the time domain. All of 
the variability indices are estimated by considering the photometric 
uncertainty and number of data points in each light curve. Because 
these indices cover different features of data which are associated 
with variability types, sampling patterns, etc., the GMM encom- 
passes a broad range of variability typ es. Using a combina tion of 
multiple indices has been suggested bv lShin & Bvunl fcoOV ). 

In our approach, the infinite number of component^ which 
are described by multivariate Gaussian distributions represent the 
six-dimensional space spanned by the variability indices. Un- 
like the GMM used in other astronomical research, our method 
is based on the DP which makes our approach non-parametric 
by c onstructing the prior probability from the given datfl (see 
IChatt opadhvav et al. 2007, for an application of the DP in astron- 
omy). The clusters of data points are self-recognised by Bayesian 
reasoning and the DP. Like other unsupervised learning methods, 
this method fully exploits all of the information in the data. 

After the infinite GMM is found for the given data, statistical 
inference measures how convincingly candidates of variable ob- 
jects can be separated out, which helps one quantify the reliability 
of recognising variable sources. The only assumption made by this 
approach is that the largest cluster of the GMM is a cluster of non- 
variable sources. Therefore, the GMM works well when data has 



^ We use component as the same term as cluster and group in this paper. 
^ Even in a non-parametric Bayesian method, a parametrised model is still 
used as in a parametric Bayesian method. The difference is that the para- 
metric method has a fixed number of parameters so that the complexity 
of the model is fixed. Meanwtiile, the number of parameters changes ac- 
cording to the complexity of the given data in the non-para metric method 
iWalker et alJl999l ; lM'uller & Ouintanj|2003l:|jordanl2005l) . 



a dominant cluster of non-variable objects as we generally find in 
astronomical time-series data. 

In this paper, we show how to use the infinite GMM with 
the DP for variability detection. Using six variability indices of 
time-series data from the Northern Sky Variability Survey (NSVS) 
jWozniak et"ai]|2004h . we find the largest cluster that should rep- 
resent a cluster of non- variable sources. The reliability of the non- 
variable cluster is tested for the size and properties of the data. We 
use the identified clusters to separate out variable source candidates 
from the data. 

The paper is organised as follows. In §2, we explain the NSVS 
data, variability indices, and infinite GMM with the DP. The appli- 
cation of the GMM is given for the sample data in §3, showing 
the reliability and stability of the found non-variable cluster that 
is examined for the size and properties of the data. We explain 
how to measure the significance of variability in §4. The discus- 
sion and conclusions are given in the last section. In the appendix, 
we present the basic mathematical explanation of the infinite GMM 
with the DP. 



2 METHOD 
2.1 Test Data 

We use light curves that have more than 15 good photometric data 
points in the NSVS databas^. A systematic search of the various 
kinds of variable sources has not been carried out with the NSVS 
data. But photometric quality control is well understood, and we 
can use the large amount of photometric data that allows us to 
recover new variable sources. We select five NSVS fields (065d, 
087a, 088d, 135b, 135d) that have the largest number of objects in 
those fields. The basic information for those fields is given in Table 
[T|( Wozniak et al. 2004). We call these data set A. As Wozni ak et"al] 
([20040 suggested in their Table 3, we use only good photometric 
data points, which avoid any artifacts from observation and data 
processing, for each object. This photometric quality cut prevents 
any effects from spurious data points. When we limit samples that 
have more than 15 good photometric points, the number of objects 
from each field is about 45,000. The total number of objects is 
227,212. 

As shown in Figure [T] the number of data points and time- 
scale of light curves has a broad range. The number of data points 
in the light curves affects the uncertainties in the variability indices 
that will be explained in the following section. Furthermore, the 
number of data points is decided by a sampling pattern for each 
field as well as the selection of good photometric data. The ex- 
tractable information is also subject to the time scale of the light 
curve. In the case of period ic variability , the Nyquist frequency is 
an important measurement ( lKoerJl2006f) . However, if we consider 
a general type of variability and irregular sampling, it is useful to 
examine the distribution of the maximum time span among data 
points. Only a tiny fraction of the light curves covers a time span of 
about 300 days with more than 100 data points, where a dominant 
fraction of the data has less than 60 data points. 

We also extract set B which has the same number of light 
curves from six different fields of the NSVS as set A. The dif- 
ferences between these two sets arise from the overall number of 
frames being larger in set B than in set A (Table |2l. However, us- 
ing only good photometric data points as we do with set A (see 
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Table 1. NSVS Fields of the set A. 



Name 


Galactic I 


Galactic b 


Number of frames 


Number of objects 


Limiting photometric scatter 


065d 


78.0 


-8.0 


235 


55051 (46925) 


0.030 


087a 


49.0 


10.0 


299 


54749 (47510) 


0.029 


088d 


60.0 


-8.0 


196 


55465 (48155) 


0.030 


135b 


16.0 


-6.0 


106 


55399 (41142) 


0.043 


135d 


27.0 


-9.0 


102 


55039 (43480) 


0.034 



We present the numbers of objects that have more than 15 good data points in the parenthesis. 




120 

Number of data points 



Figure 1. The number of data points and maximum time span for set A. Five test fields show a broad distribution in the number of data points. The maximum 
time span also reveals a broad distribution that does not depend on the number of data points. 



IWozniak et a"i]|2004 Table 3) makes the light curves of set B in- 
cludes less data points than set A. Additionally, set B has fewer 
light curves with a large time-span and many data points as shown 
in Figure |2] 



2.2 Variability indices 

Below we define six variability indices (a/fi. Con, r], J, K, 
AoVM) that are obtained from light curves in the time domain. 
The simplest index of variability is the ratio of the standard devia- 
tion to the sample mean magnitude 



- M)V(^ " 1) 



(1) 



where n is an index over the relevant data points and N is the total 
number of data points in each light curve. When this ratio is large, 
the light curve may have strong variability. We note that this ratio is 
not correspondent to a flux ratio because magnitude is a logarithmic 
unit. 

However, a / ^ does not describe detailed features of variabil- 
ity. Therefore, we find three consecutive points that are at least 2 
a fainter or brighter than the median magnitude in order to trace a 
continuous variation in the data points. The number of consecutive 



series is normalis ed by (A^ — 2), and is called Con . This measure- 
ment was used in IWozniakI ( [2OO0h . 

The systematic structure of the light curves is also quantised 
by the ratio of the mean square successive difference to the sample 
variance r) dvon Neumannlll94ll) : 



(2) 



This von Neumann ratio was suggested to test the independence of 
random variables in successive observations particularly on a sta- 
tionary Gaussian distribution. When there is a strong positive (neg- 
ative) serial correlation between sequential data points, this ratio is 
small (large). In short, if serial cor relation exists, the ratio is signif- 
icantly high or small ( |Panikll2005l) . The distribution of rj has been 
extensively investigated f or a stationary Gaussian distribution (e.g. 
iBingham & Nels on 1981), and its sample average and variance are 
well known jWilliams 1941.) . But the properties of ry are not sim- 
ple for astronomical time-series data because they are irregularly 
sampled and do not follow a simple known distribution such as a 
stationary Gaussian distribution. 

Three additional indices are adopted from concepts that have 
be en developed i n astronomy community. J and K are suggested 
bv lStetsonI jl99d) . We use the following definition that uses only a 
single photometric band: 
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Table 2. NSVS fields of the set B. 



Name 


Galactic I 


Galactic b 


Number of frames 


Number of objects 


Limiting photometric scatter 


045a 


99.0 


-6.0 


304 


54455 (45551) 


0.032 


064a 


66.0 


9.0 


308 


54320 (46392) 


0.028 


089a 


64.0 


-13.0 


289 


54363 (46639) 


0.025 


112a 


43.0 


-9.0 


228 


54334 (43191) 


0.031 


135a 


23.0 


-2.0 


112 


54096 (40601) 


0.037 


157d 


10.0 


-10.0 


61 


54391 (4838) 


0.031 



The numbers in the parenthesis represent objects that have more than 15 good data points. We use a part of the 
data from field 157d. 




116 

Number of data points 



Figure 2. The number of data points and maximum time span for set B. Compared with set A, the light curves of set B have a smaller number of data points. 
Set B also covers a smaller time span than set A. 



J — sign{5nSn+i)\/\3nS„+i\, 



K 



(3) 



(4) 



curve is real, and the star might be a long-period variable star. The 
light curve with the minimum value of r; corresponds to a known 
variable star BS Her (Nassau & Stephenson 1961]). As we expect, 
the positive serial correlation in magnitude has a small -q in this 
light curve. 



where S„ — ^ N/(N — l){xn — fJ-)/en which has a pho- 
tometric error for each data point e„ and sign{S„5n+i) is the 
sign of 5nS„+i. Finally, we measure the analysis of variance 
(AN OVA) statistic which is useful for discovering periodic sig- 
nals dSchwarzenberg-Czemvl [l99^ . The maximum value of the 
ANOVA represented by AoVM is used to measure the strength of 
periodicity. Even though the corresponding period can be incor- 
rect, the AoVM is s till a valuable quantity that infers periodicity 
dShin & Bvurj|2007h . 

Figure [3] shows light curves with variability indices that have 
the minimum, median, and maximum values across all of the 
227,212 light curves in set A. None of the light curves occur more 
than once in Figure[3] These examples prove that different variabil- 
ity indices catch different features of lig ht curves. The light curv e of 
the infrared source IRAS 18402-1742 faelou & Wa"ik^ll988h has 
the largest value of Con. We suspect that the variation of the light 



The variability indices complement each other by picking up 
different features of the light curves. As shown in Figure |4l even 
though we notice some structure in the distributions for each two- 
dimensional projection of the original six-dimensional space, the 
indices do not have a strong correlation with each other. If a dom- 
inant fraction of light curves is simply from a Normal distribu- 
tion, we would see only one simple structure in all plots. Since 
light curves of non-variable objects are not random samples from a 
Normal distribution, each plot shows more complicated structures 
which imply the existence of variable objects. Any structures will 
be defined as a separate cluster by the GMM. However, the strong 
concentration of data in each plot implies the existence of a domi- 
nant cluster of non- variable objects. Additionally, combining multi- 
ple indices helps us suppress the false detection of variable sources 
while not missing any possible features in the variability. 
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Figure 3. Light curves with minimum, median, and maximum parameter values. The left three columns present light curves with minimum, median, maximum 
values of cr/ fi. Con, and 7]. The right three columns are the same light curves but with J, K, and AoYM. Even am ong these examples, we recognise a 
light curve of a variable star with the smallest r}, which is a known variable stai" BS Her ^Nassau & Stet)hensoj|l96ll) . because of its monotonic increasing 
magnitude. The lig ht curve with the lai'ge st value of Con shows a systematic variation that may be a variable star which con'esponds to the infrared source 
IRAS 18402-1742 faelou & Waike3ll988h . 



2.3 GMM 

In the infinite GMM based on the DP, the distribution of mixture 
component members is described by a multivariate Gaussian distri- 
bution while the distribution of all objects is described by a mixture 
of Gaussian distributions defined by the stochastic DP. Each of the 
M component distributions has the following form: 

1 1 T -1 

^'"(^^ = (27r)-r/2|S |i/2 ^^P(~2^^~^"'^ (x-^^)),(5) 

where m is an index over AI, x = Con, 77, J, K, AoVM) 

is a 6-dim vector of parameters, and 7 is the number of parame- 
ters (in our case 7 = 6). Furthermore, fim is a 6-dim vector of 
mean values (i.e., mixture centres), and E^n is the covariance ma- 
trix of the Gaussian distribution associated with the mth mixture 
component. The problem is how to find a weighting for each mix- 
ture component Wm and its respective fi^n and such that the 
final distribution of all objects is given by: 

AI 

p(x) = ^pm(x)u?m. (6) 
m — 1 

The DP is used to estimate Wm, fJ-m, and Em and is explained in 
Appendix A. 

When loading a given data set, one also specifies initial val- 
ues for hyper-parameters used in the clustering algorithm. These 
hyper-parameters include the number of iterations to be taken by 
the algorithm to ensure convergence to a stable model, number of 
initial mixture components AI to which data is assigned, and con- 
centration a which can be thought of as the inverse variance of the 
DP. In all of the models presented in this paper, the number of it- 
erations is 100 even though convergence can be seen in as little as 
10 iterations, M = 60 initially, and a = 1. Convergence is de- 
fined by M reaching some consistent value despite the algorithm 
continuing to iterate. To ensure that the clustering algorithm iden- 



tifies all relevant features (i.e., number of mixture components), it 
is possible to initialise the algorithm with M — N, (i.e. the num- 
ber of data points) which is the highest possible complexity. Since 
we are mainly concerned with identifying one central cluster (i.e., 
non-variable objects), this computationally expensive procedure is 
unwarranted. With the data set and hyper-parameters loaded, the al- 
gorithm first creates an empty Gaussian distribution Go of mixture 
components M with a conjugate Gaussian- Wishart prior such that 
the mean vector is drawn from a Gaussian distribution and preci- 
sion matrix (i.e., the inverse covariance matrix E^^) is drawn from 
a Wishart distribution. Second, the algorithm randomly initialises 
the mixture component assignments z = [i?jvA/], where _Rjv is 
a A'^-dim vector with entries that are uniformly distributed. By us- 
ing a. Go, and z, data points x are added to the mixture compo- 
nents. As the algorithm iterates to convergence, this initial assign- 
ment matters little because conditional probabilities are computed 
for each data point with respect to each of the M active mixture 
components. Lastly, a collapsed Gibbs sampler runs for the spec- 
ified number of iterations while also iterating over A'^. We imple- 
ment this algorithm by using matlaeQ. 



3 THE LARGEST CLUSTER AS NON-VARIABLE 
OBJECTS 

Since the largest cluster must represent non-variable light curves, 
and our GMM with the DP is a data-driven unsupervised machine 
learning algorithm, the properties of the largest Gaussian mixture 
must be dependent on the input data. Therefore, we examine the 
dependence of results on the size and properties of the input data. 



MATLAB is a registered trademark of The Mathworks. 
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Figure 4. Distribution of variability indices for the set A. Each variability index describes a different feature of light curves in time domain. But we find the 
existence of the strong concentration of data in this six dimensions of the variability indices. It implies that the GMM can definitely find a dominant cluster of 
non-variable sources, while also separating outliers from the dominant cluster as separate minor clusters. 



3.1 Results of set A 

The GMM of set A is composed of 29 mixture components where 
one cluster dominates the data. As shown in Figure |5] the num- 
ber of mixture components quickly converges to about 29 after 10 
iterations. Moreover, the centre of the dominant cluster remains sta- 
tionary. The largest cluster is populated by 76.2% of the input data, 
and describes non-variable objects. The second and third largest 
clusters include only 7.7% and 5.6% of the data, respectively. 

The centre of the dominant cluster is. {a / ^ , Con , rj , J , K , 
AoVM) = (7.45 X 10-^ 3.90 x 10"^, 1.70, 8.29, 7.52 x 10~\ 
8.16), which also becomes stationary when the number of clusters 
converges after the 10 iterations. The covariance of the multivari- 
ate Gaussian model for the largest group is used for statistical in- 
ference to select candidates that may be variable sources and will 
be explained in §4. Measuring the ratio between the covariance of 
each variability index for the largest cluster and that for the whole 
data of set A, we find that the ratio of r] is 0.71 which is highest 
among the six variability indices. Meanwhile, the ratio of Con is 



lowest and close to zero, suggesting that this variability index has 
less powerful than others in separating out non-variable objects. 

3.2 Dependence on the size of data 

The dependence of the largest cluster on the size of the input data is 
tested using samples of set A. We randomly select 10%, 30%, and 
50% of the light curves from set A. We compute a GMM for these 
sub-samples using the same setup that was used for the main test. 
Following the basic assumption that the largest cluster represents 
non- variable objects, we identify the largest cluster in the three sub- 
sets. The GMM for the 50% sample should be closer to the GMM 
for all of set A than the GMMs for 10% and 30% samples. 

The six variability indices show dependencies on the size of 
the input data. In Figure[6] the GMM recovers more clusters as the 
size of dataset increases. Because a larger dataset can include more 
features of data, the GMM finds more separable clusters. This re- 
sult is consistent with our expectation for an infinite GMM based 
on the DP which identifies previously unseen structure as the data 
set with observable features increases in size. Although each vari- 
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Table 3. Changes of the largest group in the 10%, 30%, and 50% samples. 



Fraction of data 


Included non- variables 


Recovered non-variables 


Included variables 


Missed non-variables 


10% 


17368 


17348 (99.9) 


1082 (6.2) 


20 (0.1) 


30% 


52043 


51860 (99.6) 


2856 (5.5) 


183 (0.4) 


50% 


86818 


86070(99.1) 


2696 (3.1) 


748 (0.9) 



The numbers in parenthesises show the fraction in percentage with respect to the total number of non-variable 
members (i.e. the second column) that were included in the largest group associated with the original dataset and 
that are also included in the largest cluster associated with the subsamples. 
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Figure 5. The change in cluster properties for set A as a function of the 
number of iterations, (a) The total number of identified clusters converges 
to 29 after 10 iterations, (b) The number of data in the largest cluster reaches 
82%, but converges to 76% after 100 iterations, (c - h) The center of the 
largest cluster does not show a significant change after the 10th iteration. 
It means that the small changes in the number of clusters after the first 10 
iterations does not affect the Gaussian model of the largest cluster. 



ability index responds differently to the size of data, all indices con- 
verge more quickly with less data. But the result for the 50% subset 
shows the better convergence of the indices to the original values 
than other subsets. 

However, more iterations do not make it possible to recover 
more clusters. When we use set A, we find 29 clusters with 100 
iterations, and the number of clusters quickly converges to 29 af- 
ter 10 iterations. But a smaller data produces less clusters more 
quickly. Unsupervised learning techniques naturally handle vari- 
ation in the size of dataset which correspond to variation in the 
amount of available information. Therefore, the maximum number 
of recovered clusters is not dependent on how many times the clus- 
tering procedure iterates. 

We test the stability of the largest group derived with the orig- 
inal data by checking the membership of the largest groups with 
10%, 30%, and 50% samples. For example, 10% subsamples have 
17368 data points which were included in the largest group as 
shown in Table [3] Among those data points, 99.9% of them are 
recovered in the largest group with 10% subsamples, while 1082 
objects are newly included in the largest group. However, 20 ob- 
jects are now enclosed in minor groups with 10% subsamples. In 
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Figure 6. The change in cluster properties for randomly selected 10%, 30%, 
and 50% samples of set A. In each plot, the cross symbol corresponds to the 
converged values given in Figure|5] (a) The total number of identified clus- 
ters is smaller than that of the original dataset. The subsets are represented 
as dotted, dashed, and solid lines for 10%, 30%, and 50%, respectively, (b) 
The largest cluster in the samples has a higher percentage of the data than 
the largest cluster in the original dataset. (c - h) For all six variability in- 
dices, the result for the 50% subset is most close to what we find for the 
entire data. 



both 30% and 50% subsamples, the members of the original largest 
group are well recovered with higher than 99% rate. The clustering 
result is mainly affected by new objects which were not included in 
the largest group associated with the original data, but are included 
into the largest group associated with the sub-samples. These data 
points are mainly from the edge of the largest group in the origi- 
nal data as shown in Figure |7] The definition of the Mahalanobis 
distance Dm is 



Dm = V(x-Mo)TSo^(x 



Ho) 



(7) 



where the centre /lo and covariance matrix Eq of the largest cluster 
with the original data are used with the position of an individual 
object X in six-dimensional space. Simply, Dm corresponds to the 
exponent of the multivariate Gaussian distribution (see Equation 
O. Therefore, a high value of Dm represents a distant object from 
the centre of the largest group. Figure [7] shows that contamination 
related to the largest group is dominated by objects around the edge 
of the original largest group. 
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Figure 7. Distribution of the Malialanobis distances from the largest group 
with the original data for 10%, 30%, and 50% subsamples. In each plot, the 
distribution for all data points (i.e. the original data) is represented by dotted 
lines, while solid Hnes are the distributions of objects which are included in 
the largest group associated with both the sub-samples and the original data. 
Objects newly included in the largest group associated with the sub-samples 
(dashed line) and excluded from that (shaded bar) are mainly from the edge 
of the original largest group. The distribution represents dN/dlogioDM 
instead of dN/ dD^ . 



3.3 Dependence on the noise in data 

In order to test the effects of noise on the clustering results, we 
modify the original data set A by adding extra dispersions to the 
raw light curves. If the magnitude distribution in the raw light 
cvu^es is simply described by the Normal distribution N{^,a^) 
with mean fj. and dispersion cr^, we can increase the dispersion of 
the light curve by adding the random number from the Normal dis- 
tribution N{Q, ul^^ci) to the raw light curve, because the sum of two 
Normal distribution variables also follow Normal distribution: 



U = X + Y ^ N{iJLx + IJ.Y ,(j'x + oy), 



(8) 



where X ^ N{fix, f7"x) ^ ~ ^{fJ-Y, Here, we do not 
change the time sequence of the raw light curve, and use the disper- 
sion of the raw light curve to generate the added term with the three 
cases ofaf^j = O.lcr^, O.Sct'^, and O.Su'^. These values correspond 
to 10%, 30%, and 50% increases in dispersions, respectively. Fig- 
ure [8] shows one example light curve which is a member of the 
largest group associated with the original data. 

We warn that our approach to degrade the data can be quite 
different from realistic cases. First, there is no guaranty of assum- 
ing a Normal distribution for the raw light curves. Second, even 
when raw light curves follow a Normal distribution, the rule given 
in Equation [8] is not well implemented when light curves have a 
small number of data points. Third, if the raw light curve has in- 
trinsic variability which might result in a large dispersion, using 
the dispersion from the raw light curve in Equation [8] can cause 
systematically biased effects on the light curves of truly variable 
objects. Because of these reasons, the increase in dispersion can 
deviate from the expected change in Equation[8]as shown in Figure 
[8] Our simulation also fails to reproduce red noise if the data have. 



Figure 8. Example light curves with increased dispersions. From top to 
bottom, each panel shows the raw light curve and light curves with 10%, 
30%, and 50% increased dispersions, respectively. The dispersion of the 
raw light curve is increased by adding random values that sampled from a 
Normal distribution to the existing raw data points. 
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Figure 9. The change in cluster properties for set A with 10%, 30%, and 
50% increased dispersions. In each plot, the cross symbol corresponds to 
the converged values given in Figure |5] and dotted, dashed, and solid lines 
represent 10%, 30%, and 50% increased dispersions, respectively, (a) The 
total number of identified clusters is higher than that of the original dataset. 
(b) The fraction of data in the lai'gest cluster decreases substantially com- 
pared with the largest cluster associated with the original data, (c - h) Six 
variability indices have different sensitivities to the change in magnitude 
dispersion, implying that variabiUty indexes are important to clustering. 
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Table 4. Changes of the largest group in the samples with 10%, 30%, and 50% increased dispersions. 



Increased dispersions 


Recovered non-variables 


New non-variables 


Excluded non- variables 


10% 


146170 (84.4) 


1808 (1.0) 


27017 (15.6) 


30% 


141332 (81.6) 


3724 (2.2) 


31855 (18.4) 


50% 


143689 (83.0) 


6998 (4.0) 


29498 (17.0) 



The numbers in parenthesises show the fraction in percentage with respect to the total number of members that 
were included in the largest group associated with the original data. 




Figure 10. Distribution of the Mahalanobis distances from the original 
largest group for the samples with 10%, 30%, and 50% increased disper- 
sions. The distribution for the original data is described by dotted lines. For 
the members of the original largest group, parts of them are still included in 
the largest group with the noise data (solid line; the second column in Table 
|4) even after they were altered by added dispersions. But as the dispersion 
increases, more objects (shaded histogram; the third column in Table |4) 
are newly included in the largest group with the noise data, while some 
members of the largest group with the oiiginal data (dashed line; the fourth 
column in Table|4) are now excluded from the new largest group. 



Even though this test might not be reahstic, unsupervised learning 
intrinsically lacks a way to study noise effects without providing 
completely artificial data. 

Figure|9]summarises the effects of noise on clustering and the 
largest group. The increase in noise enhances dispersions among 
clusters that were found with the original data, and results in the 
recovery of more clusters because the largest group is populated by 
fewer objects. Importantly, variability indices associated with the 
centre of the largest group responds to the effects of noise in differ- 
ent ways. Therefore, the change of the cluster centre for the largest 
group is not a simple function of the dispersion change although 
the data with the low noise generally converges to the results for 
the original data. 

We also trace which objects are included in the newly found 
largest cluster. The added dispersion naturally boosts mixing be- 
tween the original largest group and other minor groups. As pre- 
sented in Table|4l about 16% - 18% of objects that were included 
in the largest group associated with the original data are found in 
minor groups with the increased dispersions. Meanwhile, the ad- 



Figure 11. Distribution of the Mahalanobis distances from the new largest 
group in the samples with the increased dispersions. The solid line shows 
the distribution of all objects with respect to the new largest group. The 
dashed line and shaded histogram represent the same objects as explained 
in Figure[To] The top two largest clusters are shown by dotted lines for each 
case. 

dition of new objects to the largest group is a small percentage. 
Figures [To] and [TT] show that the increased dispersions induce the 
objects around the edge of the original largest cluster to move from 
the largest cluster in the new clustering. Figure [TTI demonstrates 
that this effect mainly results in grouping objects into the second 
largest cluster in the new data. 

3.4 Dependence on the source of data: results of set B 

We test the dependence of the GMM on the properties of a partic- 
ular dataset by applying our method to set B. As described in §2.1, 
set B has different properties of data. The largest cluster of set B is 
populated by 83.1% of data, while we find that the largest cluster 
of set A is populated by 76.2% of the data (see Figure [T2l(. The 
number of recovered clusters is 26 which is smaller than that of set 
A. Even though fewer clusters are identified in set B, the largest 
cluster in set B describes more of the data. 

Figure [T2] shows how each variability index changes based on 
the input data. In this test, J is a signature of a large change that 
depends on the properties of the data. We note that K or AoVM 
is a variability index that shows the largest change for the noisy 
data (see Figure The centre of the largest cluster in set B is 
(cr/^i , Con , rj , J , K , AoVM ) = (6.84 x 10~^ 3.45 x 10"^ 
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Figure 12. The change in cluster properties for set B. The plotted fields 
are the same as those in Fieure [5] We find some difference in the largest 
cluster between sets A and B as we expect in data-driven machine learning. 
In particular, J shows the most significant difference. 



1.68, 6.41 X 10"^, 7.52 x 10"\ 7.75). This result implies that our 
method has to be applied to a single dataset that shares common 
properties. This requirement is often necessary for data-oriented 
machine learning methods. Compared to the test associated with 
increasing the dispersion of light curves, the experiment with set 
B is more realistic in proving the data-dependence of unsupervised 
machine learning algorithms. 



4 SEPARATION OF VARIABLE OBJECTS 

After we identify the largest cluster as the aggregation of non- 
variable objects, the next question is how to separate out candidates 
of variable objects. Naively, we can accept the clustering results of 
the GMM as a guide line for the separation. But simply depending 
on clustering results is not satisfactory for two reasons. First, any 
systematic spurious patterns can be clustered as the second or third 
largest cluster, as seen in Figure [TT] Second, some clusters can be 
close to the largest cluster in six-dimensional space, implying that 
the separation of other clusters from the largest cluster might not 
be meaningful. Therefore, if the clustering result is used to define 
candidates of variable objects, then various classical methods for 
multivariate analysis can be applied (Krzanowski 1988) in addition 
to the cluster membership from the GMM with the DP. We suggest 
two simple ways to use the results from our GMM method with the 
clustering membership. 



4.1 Inference from Mahalanobis distances 

The first approach uses the Mahalanobis distance to gauge how far 
an object is from the largest cluster. For our application, the Maha- 
lanobis distance is more useful than a multidimensional n orm be- 
cause it includes the effects from the dispersion of the data jBishod 
l2006h . 

The distribution of Dm shown in Figure[T3lindicates that a cut 





4 



Figure 13. Mahalanobis distance of all objects in set A. The distribution of 
all objects (thick solid line) has a peak around Dm ^ 2 which corresponds 
to the distribution of only the largest cluster (thin solid line). The second 
and third largest clusters (dotted lines) are distributed closely to the largest 
cluster, even though they are identified separately by the GMM with the DP. 



based on Dm can be used to identify variable objects. This distri- 
bution has a concentration of objects around Dm ~ 2. The position 
of this peak matches the mode value of the Beta distribution which 
is exp ected for the distribution of Djv/ ( Ververidis & Kotropoul(3 
I2OO8I) . This distance also represents the typical distance of ob- 
jects that are included in the cluster of non-variable objects (i.e. 
the largest cluster). Furthermore, this distribution confirms that the 
second and third largest clusters may not represent real variable 
objects because the members of the clusters are close to the largest 
cluster. 

Even though Dm is inexpensive to compute, it does not give 
direct statistical inference nor provide a statistical confidence limit 
on our belief that an object is variable. Dm is simply an exponent 
of the multivariate Gaussian distribution (see Equation|7]l. One has 
to find an empirical cut for Dm that separates variable and non- 
variable objects. 

4.2 Confidence bounds 

The way to extract a direct statistical inference is to derive confi- 
dence bounds for non- variable objects with Dm- With the identi- 
fied centre /io and covariance matrix Eo of the largest cluster, we 
define a confidence bound of 1006% (0 < b < 1) which encom- 
passes 100&% of non-variable objects (seel Chen. Morris. & Martiiil 
.2006.. for an example). The confidence bound is described as a like- 
lihood threshold h that is associated with the probability b: 



p(x|/.(o,So)dx = b, (9) 

/x:p(x)>h 

where p(x) is a multivariate Gaussian distribution defined by /lo 
and Eo. From this integration, we can estimate a confidence limit 
that corresponds to a specific Dm for h. But despite its statistical 
robustness, this integration is practically difficult and expensive to 
compute because it cannot be calculated analytically. 
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We use a Monte Carlo method to find the confidence hmit 
in Equation [9] An approximate cut is simply the value of Dm that 
includes lOOfe % of the data in the largest cluster, when sorting Dm 
in an ascending order, i.e. a descending order of p. However, a more 
precise estimate is made possible by generating multiple samples of 
the data that pop ulate the largest cluster and find ing a limit for Dm 
in each sample jChen. Morris, & Martini l2006h . In Figure [T3] we 
can guess that Dm — 4.67 for set A where an approximate cut of 
99% is assumed. But we find Dm = 4.68, when using the Monte 
Carlo method by sampling 50 times with 2000 samples for each 
sampling. Because the largest group includes a large number of 
data points, the simple approximation is close to the estimate given 
by the Monte Carlo method. When we choose a 90% cut in Dm to 
define variable source candidates, the total number of candidates is 
50,394 for set A and corresponds to about 22% of the light curves. 
But we find that about 9% and 29% of objects in the second and 
third largest group are within the 99% cut of the largest group (i.e. 
Dm = 4.68). 

4.3 Examples of light curves for each cluster 

Our method provides two pieces of information to help people se- 
lect variable source candidates. First, objects are chosen as candi- 
dates when they are not included in the largest cluster. This idea 
corresponds to a classical method of outlier detection which uses 
clustering. Second, we use the cut D]\f in addition to the result of 
clustering. As shown in Figure[T3] the second and third largest clus- 
ters are overlapped with the largest cluster in the six-dimensional 
space. This se cond approach is relev ant to the distance-based out- 
lier detection dCateni, Colla, & Vann ucci 2008). The most useful 
approach is to employ both the clustering results and the statistical 
cut in Dai . If we conclude that the second and third largest clusters 
are explained by a systematic bias in the data, then we can exclude 
the second and third largest clusters when selecting variable source 
candidates. Furthermore, Dm can be used to assign a priority to the 
candidates. 

Figure [14] presents an example of light curves for clusters I, 
2, 3, and 4 in set A. Here, we randomly select two light curves for 
each cluster. Cluster I is the fourth largest cluster in the GMM for 
set A. In order to check for known variable sources in our samples, 
we sp atially match our samples to the SIMBAD database dOenoval 
l2007h using a 6'.'0 search radius. For cluste r 3, the second object 
is the known variable star SV* BV 171 1 jStrohm eier & Knigg3 
I1975I) . The second object for cluster 4 is th e known infrared source 
IRAS 19225-0740 tHelou & Walkeil [19881) which may be a long- 
period late-type variable star. 

For the rest of the identified clusters in set A, we also ran- 
domly extract two example objects. These light curves are pre- 
sented in Figures [T5] 1 161 and[T7] Only a small number of objects 
among the examples are known variable stars or infrared sources 
that might be long-period variable stars. The clusters 10 and 13, 
corresponding to the second and third largest cluster respectively, 
show similarities in their light curves to those of the largest clus- 
ter (i.e. the cluster 7). Because of poor sampling for short-period 
variable sources in the NSVS, it is not likely for us to recognise 
periodic short variability in the example light curves. 



5 DISCUSSION AND CONCLUSION 

We presented a new framework for discovering variable objects in 
massive time-series data with variability indices which have been 
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Figure 14. An example of light curves for clusters 1, 2, 3, and 4. The sec- 
ond light curve for cluster 3 with Dm = 93.5 is ma tched to a known 
vaiiable star SV* BV 1711 (S trohmeier & Knigge 1975), while the second 
light curve for cluster 4 is IRAS 19225-0740 tHelou & Walker,, 1988ii . 

commonly used in astronomy. Our method is fully non-parametric 
and depends on only one assumption: the largest cluster represents 
a group of non-variable objects. The infinite GMM with the DP 
derives a mixture of multivariate Gaussian distributions from the 
given data consisting of six variability indices. With these results, 
we use the clustering results and Mahalanobis distances from the 
largest cluster to select variable object candidates. 

Our application of the infinite GMM with the DP for cluster- 
ing may be useful for measuring how efficiently we recover vari- 
able objects depending on various factors. Before designing the ob- 
servation strategy to acquire time-series data, simulated data can be 
applied to our method. This test will help people understand what 
kind of variability is missed in the data given by a specific observa- 
tion strategy and environment. 

Pre-processing the time-series data, i.e. feature extraction and 
selection, can affect the results of the infinite GMM with the DP. 
This effect has been seen in other unsupervised clustering algo- 
rithms, too jjain et at I fl99^ . In this paper, we only use six vari- 
ability indices which are mainly developed for astronomical time- 
series data. Unlike time-series data in other fields, astronomical 
time-series data are irregularly sampled and less homogeneous. 
This difference makes pre-processing of our data with a common 
method such as principal component analysis difficult. Moreover, 
finding the be st features for un supervised clustering is a trial-and- 
error problem ('lain et aljl999ll . In the framework of the GMM with 
the DP, the importance of each variability index is reflected in each 
Gaussian component's covariance matrix which also describes the 
compactness of the found clusters. Therefore, the combination of 
the best features will be dependent of the i nput data, while mak - 
ing this study be a trial-and-error problem tov & Brodlevl |2004h . 
In the next paper of this series, we will investigate the usefulness of 
a variety of variability indices for the infinite GMM with the DP. 

Finally, simply selecting variable star candidates with the un- 
supervised learning method is not useful without analysing what 
kind of objects are selected as candidates. Our approach also uses 
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Figure 15. Example light curves for clusters 5 - 12. None of these examples are known variable stars. Cluster 7 is the largest cluster which represents non- 
variable objects. Both light curves of cluster 6 show a recognisable change in brightness even with poor sampling of the light curves. Cluster 10 is the second 
largest cluster that has some objects within the 99% cut of Z)j\/ ~ 4.7 
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Figure 16. Example light curves for clusters 13 - 20. No known variable sources were found for these example objects within the 6'/0 search radius using the 
SIMBAD. But both examples of cluster 17 show long-period variability. Additionally, the examples of cluster 16 might be eclipsing binaries. Cluster 13 is the 
third largest cluster, and includes about 6% data of set A. 



unlabelled data which we do not know any physical properties 
about. It is necessary to figure out properties of the selected can- 
didates in the further analysis of variable time-series data. 

We showed that our method is a fully data-driven approach 
such that the method itself finds its best separation of variable ob- 
jects for the given data. This property makes our idea easily ap - 
plicabl e to future projects su ch as Pan-S TARRS dKaiseil l2004h . 
GAIA ( lEver&CuvD"CTsll2000h . and LSST l lWalkedl2003l) as well 
as archives of the past surveys such as MACHO JCook et al.ll995[) . 



In another paper, we will provide a full list of variable object can- 
didates from the NSVS for each observation field. 
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Figure 17. Example light curves for clusters 21 - 29. Except for clusters 28 and 29 which have only one member, two examples are presented for each cluster. 
The first example of cluster 21 spatially corresponds to IRAS 20302+1938 i Helou & Walker 1988). The second example with Dm = 7.22 x lO"^ is the 
infrai-ed carbon star IRAS 18364+1757 faelou & Waiker| |l988- Gugli elmo et aljl 19931) . While the first example of cluster 22 is the known variable star V* 
V2328 Cyg i Dahlmark 2001), the second example is n ot a known variable object even thou gh its light curve shows a clear sign of variability. For cluster 26, 
the first example is a Mira-type variable star V* Z Del iXempleton, Mattel, & Willsonl2005h . 
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APPENDIX A: BAYESIAN NON-PARAMETRIC 
CLUSTERING 

Bayesian nonparametric clustering algorithms based on the Dirich- 
let Process are a powerful way to model and manipulate data in 
statistics, machine learning, and signal processing. In this type of 
analysis, Bayesian refers to the manner in which one estimates the 
likelihood of an event given information in the data set about all 
known events and nonparametric refers to the manner in which a 
set of events can be modelled such that the structure of the model 
is determined only by the data set. Since Bayesian nonparametric 
techniques are not based on prior assumptions about the structure, 
number of mixture components, or location of components in a data 
set, one employs the Dirichlet Process to assign a prior probability 
(i.e., the unconditional probability of an event before relevant in- 
formation is considered) to a data point x„ such that the stochastic 
generative process draws from a distribution of distributi ons (in our 
case, a mixture of multivariate Ga ussian distributions) jpergusor] 
1 19731; lAntoniaklll974l: |jordanll2005l) . 

The Dirichlet Process mixtures are also referred to as infinite 
mixtures because although data may exhibit a finite number of com- 
ponents, new data can exh ibit previously unseen structure jNeall 
bOOd : iBlei & JordarJiiooi) . Therefore, these models adjust their 
complexity according to the co mplexity of the data and mitigate 
under-fitting the data ( lTehll2007 ). In this unsupervised algorithm, 
no data points were discarded as background. 

To understand the Dirichlet Process and our Bayesian non- 
parametric clustering algorithm, we explain the following ideas 
from probability theory. Let 77 be a probability space. Go be a distri- 
bution over Tj, and a be a positive real number (in our case, a = 1). 
Therefore, a random distribution G over 77 is said to be Dirichlet 
Process distributed: 

G~DP(q,Go), (Al) 
if and only if for all natural numbers j and any finite partition 



{Al, Aj) of T), the random vector [G{A\), . . . , G{Aj)) is dis- 
tributed as a finite-dimensional Dirichlet distribution: 



(G(Ai), . . . , G{Aj)) ~ Dir(QGo(Ai), . . . , aGo(^)), 



(A2) 



where Go is the base distribution of G (i.e., mean of the Dirichlet 
Process) and a is the concentration parameter (i.e., inverse variance 
of the Dirichlet Process) (Blei & Jordan 2004; Jordan 2005; Te3 
I2OO7I) . 

Bayesian nonparametric clustering based on the Dirichlet pro- 
cess can be applied to A'^-dim data with multiple parameter fields 
(si, . . . ,xn) provided that the data is regarded as being part of 
an indefinite exchangeable sequence. One models the distribution 
from which x is drawn as a mixture of distributions of the form 
F{r^), with the mixing distribution over 77 being G, which has the 
Dirichlet Process as a nonparametric prior probability. Therefore, 
the Dirichlet Process mixture model is represented as (Fergusonl 
(1973; Antoniak 1974; Ne al 2000.; .Blei & Jordan.2004 ; Teh 2 007): 




DP(a,Go), 

G, 



(A3) 
(A4) 
(A5) 



Since the parameters r] are drawn from G, the data x clusters 
according to the values of r). For the cluster model presented in 
this work, x is drawn from F, which is assumed to be a mixture of 
multivariate Gaussian distributions. Therefore, 77 
where /i,„ is the mean and Em is the covariance matrix for the tti*'' 
mixture component. 

In Dirichlet Process mixture modelling, the posterior distri- 
bution on the partitions (i.e., the conditional probability of the 
mixture components) is intractable to compute. However, Markov 
chain Monte Carlo methods allow one to approximate posteriors by 
constructing a Markov chain that is easy to implement for models 
based on conjugate prior di stributions s uch as the Gaussian distri- 
butions used in this paper ( lNealll2000l ; iBlei & Jordai]|2004h . The 
most widely used inference method is the Gibbs sampler because 
of its simplicity and good predictive performance. In the Gibbs 
sampler, the Markov chain is obtained by iteratively sampling each 
variable that is conditioned on the data and other previously sam- 
pled variables. If one integrates out all random variables except qm 
(i.e., mixture component that the n*'^ data point Xn is associated 
with), then one arrives at the collapsed Gibbs sampl er, which itera- 
tively draws each g™, from the following expression jBlei & JordanI 



tively c 



p(<7m = l\x,q^m,'^,a) p{x„\x-i,q^m,qm. = l,X)p{qm = l|g-m, q), (A6) 

where q-m denotes all of the previously sampled cluster variables 
except for the m"' variable and A is a hyper-parameter that is used 
to define the base distribution Go. The first term on the right-hand 
side of Equation |A6| is a combination of normalising constants that 
comes from considering Dirichlet Process mixtures for which data 
is drawn from an exponential family (e.g., Gaussian distribution). 
The second term on the right-hand side is: 



p((?m = l\q~m,a) 



N-l + a 



seen component 
unseen component 



(A7) 



iV-l + Q 

where 7i,„ is the number of members in ~ 1- Equation |A7| 
comes from the partition structure of the Dirichlet Process and is 
the heart of the algorithm's clustering effect such that the more 
frequently an event (i.e., a mixture component) is sampled in 
the p ast - the more likely the event is to be sampled in the fu- 
ture & JordanI 120041) . Once the Markov chain has run for 
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a sufficiently long duration, samples of q will be samples from 
p{q\x, a, A) and one can construct an empirical distribution to ap- 
proximate the posterior. 

The collapsed Gibbs sampler runs for a specified number of 
iterations and also iterates over the numb er of data items A'^. The 
method proceeds with the following steps ( lTehll2007l) : 

(i) Remove data item a;„ from component g™, where m speci- 
fies the cluster to which data item n belongs 

(ii) Delete the active component if it has become empty 

(iii) Compute conditional probabilities (pi, . . . , pm) with re- 
spect to data item Xn belonging to each of the A/ active compo- 
nents (gi, . . . , qm) 

(iv) Choose new component identity m by sampling from the 
conditional probabilities 

(v) If m = Af + 1, then create a new active component 

(vi) Add data item x„ into component 



